Regression Models

STA 101 - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • Assignments

Recap of last lecture

  • Histogram, scatterplot, boxplot
  • Average, median, variance, sd and IQR; robustness
  • Frequency, contigency and proportion tables
  • Barplot, mosaic plot
  • Effective communication: well-edited figures, \(\ge3\) variables (symbols, colors, facets)
  • R for Data Science - chapters 3 and 7

Simple linear regression

MPG

d <- ggplot2::mpg
head(d)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

Imagine that we do not have the variable hwy (fuel efficiency on the highway) and that is expensive to measure it.

We decide to estimate it from the variable cty; after all, cty should be a good proxy for hwy.

Linear regression

ggplot(d) +
  geom_point(aes(x = cty, y = hwy))

Although cty is not a perfect predictor of hwy, it does a pretty good job.

Let capture that is the following linear regression model \[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} \] where the numbers \(\beta_0\) and \(\beta_1\) are the two parameters of the model.

Good parameters

We want to find values for \(\beta_0\) and \(\beta_1\) that give a good model.

Let us look at two sets of values: - \(\beta_0 = 1\), \(\beta_1 = 1.3\) \[ \text{hwy} \approx 1 + 1.3 \text{cty} \] - \(\beta_0 = 15\), \(\beta_1 = 0.5\) \[ \text{hwy} \approx 15 + 0.5\text{cty} \]

ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = 1, slope = 1.3)

ggplot(d) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(intercept = 15, slope = 0.5)

Uninsurance and high school graduation rates in NC

Data source

  • The data come from usdata::county_2019
  • These data have been compiled from the 2019 American Community Survey
# data prop for mapping
dfips <- maps::county.fips %>%
  as_tibble() %>% 
  extract(polyname, c("region", "subregion"), "^([^,]+),([^,]+)$") %>%
  filter(region %in% c("north carolina", "new york"))

map_county_2019 <- map_data("county") %>%
  as_tibble() %>%
  filter(region %in% c("north carolina", "new york")) %>%
  left_join(dfips) %>%
  mutate(fips = if_else(
    subregion == "currituck" & region == "north carolina", 37053L, fips
  )) %>%
  left_join(county_2019, by = "fips")

map_county_2019_nc <- map_county_2019 %>%
  filter(state == "North Carolina")

Uninsurance rate

ggplot(map_county_2019_nc, 
       aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = uninsured)) +
  scale_fill_viridis_c(option = "E", labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = NULL, y = NULL, fill = NULL,
    title = "Percent uninsured (2015 - 2019)",
    subtitle = "Civilian noninstitutionalized population in NC"
  ) +
  coord_quickmap(clip = "off") +
  theme_void() +
  theme(
    legend.direction = "horizontal",
    legend.position = c(0.2, 0.1),
  ) +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "durham"), aes(fill = uninsured), color = "white") +
  annotate("text", x = -78.7, y = 36.3, label = "Durham County\n(12%)", hjust = 0, size = 4, color = "white", fontface = "bold") +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "swain"), aes(fill = uninsured), color = "black") +
  annotate("text", x = -83.4, y = 35.9, label = "Swain County\n(21.5%)", hjust = 1, size = 4, fontface = "bold") +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "gates"), aes(fill = uninsured), color = "black") +
  annotate("text", x = -76.9, y = 36.8, label = "Gates County\n(6.6%)", hjust = 0, size = 4, color = "black", fontface = "bold")

High school graduation rate

map_county_2019 %>%
  filter(state == "North Carolina") %>%
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = hs_grad)) +
  scale_fill_viridis_c(option = "D", labels = label_percent(scale = 1, accuracy = 1)) +
  labs(x = NULL, y = NULL, fill = NULL,
    title = "Percent high school graduate (2015 - 2019)",
    subtitle = "25 and older population in NC") +
  coord_quickmap(clip = "off") +
  theme_void() +
  theme(
    legend.direction = "horizontal",
    legend.position = c(0.2, 0.1),
  ) +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "durham"), aes(fill = hs_grad), color = "white") +
  annotate("text", x = -78.7, y = 36.3, label = "Durham County\n(88.4%)", hjust = 0, size = 4, color = "white", fontface = "bold") +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "tyrrell"), aes(fill = hs_grad), color = "black") +
  annotate("text", x = -76.4, y = 36.3, label = "Tyrrell\nCounty\n(74%)", hjust = 0, size = 4, color = "black", fontface = "bold") +
  geom_polygon(data = map_county_2019_nc %>% filter(subregion == "dare"), aes(fill = hs_grad), color = "black") +
  annotate("text", x = -75.9, y = 35.2, label = "Dare\nCounty\n(94.2%)", hjust = 0, size = 4, color = "black", fontface = "bold")

Examining the relationship

  • The NC Labor and Economic Analysis Division (LEAD), which “administers and collects data, conducts research, and publishes information on the state’s economy, labor force, educational, and workforce-related issues”.
  • Suppose that an analyst working for LEAD is interested in the relationship between uninsurance and high school graduation rates in NC counties.

What type of visualization should the analyst make to examine the relationship between these two variables?

Data prep

county_2019_nc <- county_2019 %>%
  as_tibble() %>%
  filter(state == "North Carolina") %>%
  select(name, hs_grad, uninsured)

county_2019_nc
# A tibble: 100 x 3
   name             hs_grad uninsured
   <chr>              <dbl>     <dbl>
 1 Alamance County     86.3      11.2
 2 Alexander County    82.4       8.9
 3 Alleghany County    77.5      11.3
 4 Anson County        80.7      11.1
 5 Ashe County         85.1      12.6
 6 Avery County        83.6      15.9
 7 Beaufort County     87.7      12  
 8 Bertie County       78.4      11.9
 9 Bladen County       81.3      12.9
10 Brunswick County    91.3       9.8
# ... with 90 more rows

Uninsurance vs. HS graduation rates

Show the code
ggplot(county_2019_nc,
       aes(x = hs_grad, y = uninsured)) +
  geom_point() +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Uninsurance vs. HS graduation rates",
    subtitle = "North Carolina counties, 2015 - 2019"
  ) +
  geom_point(data = county_2019_nc %>% filter(name == "Durham County"), aes(x = hs_grad, y = uninsured), shape = "circle open", color = "#8F2D56", size = 4, stroke = 2) +
  geom_text(data = county_2019_nc %>% filter(name == "Durham County"), aes(x = hs_grad, y = uninsured, label = name), color = "#8F2D56", fontface = "bold", nudge_y = 3, nudge_x = 2)

Modeling the relationship

Show the code
ggplot(county_2019_nc, aes(x = hs_grad, y = uninsured)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "#8F2D56") +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Uninsurance vs. HS graduation rates",
    subtitle = "North Carolina counties, 2015 - 2019"
  )

Fitting the model

With fit():

nc_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(uninsured ~ hs_grad, data = county_2019_nc)

tidy(nc_fit)
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   33.9      3.99        8.50 2.12e-13
2 hs_grad       -0.262    0.0468     -5.61 1.88e- 7

Augmenting the data

With augment() to add columns for predicted values (.fitted), residuals (.resid), etc.:

nc_aug <- augment(nc_fit$fit)
nc_aug
# A tibble: 100 x 8
   uninsured hs_grad .fitted  .resid   .hat .sigma    .cooksd .std.resid
       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
 1      11.2    86.3   11.3  -0.0633 0.0107   2.10 0.00000501    -0.0305
 2       8.9    82.4   12.3  -3.39   0.0138   2.07 0.0186        -1.63  
 3      11.3    77.5   13.6  -2.27   0.0393   2.09 0.0252        -1.11  
 4      11.1    80.7   12.7  -1.63   0.0199   2.09 0.00633       -0.790 
 5      12.6    85.1   11.6   1.02   0.0100   2.10 0.00122        0.492 
 6      15.9    83.6   12.0   3.93   0.0112   2.06 0.0203         1.89  
 7      12      87.7   10.9   1.10   0.0133   2.10 0.00191        0.532 
 8      11.9    78.4   13.3  -1.44   0.0328   2.09 0.00830       -0.700 
 9      12.9    81.3   12.6   0.324  0.0174   2.10 0.000218       0.157 
10       9.8    91.3    9.95 -0.151  0.0291   2.10 0.0000806     -0.0734
# ... with 90 more rows

Visualizing the model I

  • Black circles: Observed values (y = uninsured)
p_nc_aug_base <- ggplot(nc_aug, aes(x = hs_grad)) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(x = "High school graduate", y = "Uninsured")

p_nc_aug_base +
  geom_point(aes(y = uninsured))

Visualizing the model II

  • Black circles: Observed values (y = uninsured)
  • Pink solid line: Least squares regression line
p_nc_aug_base +
  geom_point(aes(y = uninsured)) +
  geom_smooth(aes(y = uninsured), method = "lm", se = FALSE, color = "pink")

Visualizing the model III

  • Black circles: Observed values (y = uninsured)
  • Pink solid line: Least squares regression line
  • Maroon triangles: Predicted values (y = .fitted)
p_nc_aug_base +
  geom_point(aes(y = uninsured)) +
  geom_smooth(aes(y = uninsured), method = "lm", se = FALSE, color = "pink") +
  geom_point(aes(y = .fitted), color = "maroon", shape = "triangle", size = 2)

Visualizing the model IV

  • Black circles: Observed values (y = uninsured)
  • Pink solid line: Least squares regression line
  • Maroon triangles: Predicted values (y = .fitted)
  • Gray dashed lines: Residuals
p_nc_aug_base +
  geom_segment(aes(xend = hs_grad, y = uninsured, yend = .fitted), size = 0.3, linetype = "dashed", color = "gray20") +
  geom_point(aes(y = uninsured)) +
  geom_smooth(aes(y = uninsured), method = "lm", se = FALSE, color = "pink") +
  geom_point(aes(y = .fitted), color = "maroon", shape = "triangle", size = 2)

Evaluating the model fit

How can we evaluate whether the model for predicting uninsurance rate from high school graduation rate for NC counties is a good fit?

Model evaluation

Two statistics

  • R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

    \[ R^2 = Cor(x,y)^2 = Cor(y, \hat{y})^2 \]

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

    \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

R-squared

  • Ranges between 0 (terrible predictor) and 1 (perfect predictor)

  • Unitless

  • Calculate with rsq():

    rsq(nc_aug, truth = uninsured, estimate = .fitted)
    # A tibble: 1 x 3
      .metric .estimator .estimate
      <chr>   <chr>          <dbl>
    1 rsq     standard       0.243

Interpreting R-squared

nc_fit_rsq <- round(glance(nc_fit)$r.squared * 100, 1)

🗳️ Vote on Slack

The \(R^2\) of the model for predicting uninsurance rate from high school graduation rate for NC counties is 24.3%. Which of the following is the correct interpretation of this value?

  • High school graduation rates correctly predict 24.3% of uninsurance rates in NC counties.
  • 24.3% of the variability in uninsurance rates in NC counties can be explained by high school graduation rates.
  • 24.3% of the variability in high school graduation rates in NC counties can be explained by uninsurance rates.
  • 24.3% of the time uninsurance rates in NC counties can be predicted by high school graduation rates.

Alternative approach for R-squared

Alternatively, use glance() to construct a single row summary of the model fit, including \(R^2\):

glance(nc_fit)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.243         0.235  2.09      31.5 0.000000188     1  -214.  435.  443.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(nc_fit)$r.squared
[1] 0.2430694

RMSE

  • Ranges between 0 (perfect predictor) and infinity (terrible predictor)

  • Same units as the outcome variable

  • Calculate with rmse():

    rmse(nc_aug, truth = uninsured, estimate = .fitted)
    # A tibble: 1 x 3
      .metric .estimator .estimate
      <chr>   <chr>          <dbl>
    1 rmse    standard        2.07
  • The value of RMSE is not very meaningful on its own, but it’s useful for comparing across models (more on this when we get to regression with multiple predictors)

Obtaining R-squared and RMSE

  • Use rsq() and rmse(), respectively

    rsq(nc_aug, truth = uninsured, estimate = .fitted)
    rmse(nc_aug, truth = uninsured, estimate = .fitted)
  • First argument: data frame containing truth and estimate columns

  • Second argument: name of the column containing truth (observed outcome)

  • Third argument: name of the column containing estimate (predicted outcome)

Purpose of model evaluation

  • \(R^2\) tells us how our model is doing to predict the data we already have
  • But generally we are interested in prediction for a new observation, not for one that is already in our sample, i.e. out-of-sample prediction
  • We have a couple ways of simulating out-of-sample prediction before actually getting new data to evaluate the performance of our models

Splitting data

Spending our data

  • There are several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
  • Doing all of this on the entire data we have available leaves us with no other data to assess our choices
  • We can allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)

Simulation: data splitting

cty::: {.column width=“30%”} ::: nonincremental - Take a random sample of 10% of the data and set aside (testing data) - Fit a model on the remaining 90% of the data (training data) - Use the coefficients from this model to make predictions for the testing data - Repeat 10 times

:::

set.seed(345)

n_folds <- 10

county_2019_nc_folds <- county_2019_nc %>%
  slice_sample(n = nrow(county_2019_nc)) %>%
  mutate(fold = rep(1:n_folds, n_folds)) %>%
  arrange(fold)

predict_folds <- function(i) {
  fit <- lm(uninsured ~ hs_grad, data = county_2019_nc_folds %>% filter(fold != i))
  predict(fit, newdata = county_2019_nc_folds %>% filter(fold == i)) %>%
    bind_cols(county_2019_nc_folds %>% filter(fold == i), .fitted = .)
}

nc_fits <- map_df(1:n_folds, predict_folds)

p_nc_fits <- ggplot(nc_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
  geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Predicted uninsurance rate in NC",
    subtitle = glue("For {n_folds} different testing datasets")
    )

p_nc_fits

:::

Predictive performance

  • How consistent are the predictions for different testing datasets?
  • How consistent are the predictions for counties with high school graduation rates in the middle of the plot vs. in the edges?
p_nc_fits

Bootstrapping

Bootstrapping our data

  • The idea behind bootstrapping is that if a given observation exists in a sample, there may be more like it in the population
  • With bootstrapping, we simulate resampling from the population by resampling from the sample we observed
  • Bootstrap samples are the sampled with replacement from the original sample and same size as the original sample
    • For example, if our sample consists of the observations {A, B, C}, bootstrap samples could be {A, A, B}, {A, C, A}, {B, C, C}, {A, B, C}, etc.

Simulation: bootstrapping

  • Take a bootstrap sample – sample with replacement from the original data, same size as the original data
  • Fit model to the sample and make predictions for that sample
  • Repeat many times
n_boot <- 100

predict_boots <- function(i){
  boot <- county_2019_nc %>%
    slice_sample(n = nrow(county_2019_nc), replace = TRUE) %>%
    mutate(boot_samp = i)
  fit <- lm(uninsured ~ hs_grad, data = boot)
  predict(fit) %>% bind_cols(boot, .fitted = .)
}

set.seed(1234)
county_2019_nc_boots <- map_df(1:n_boot, predict_boots)

p_nc_boots <- ggplot(county_2019_nc_boots, aes(x = hs_grad, y = .fitted, group = boot_samp)) +
  geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Predicted uninsurance rate in NC",
    subtitle = glue("For {n_boot} bootstrap samples")
    )

p_nc_boots

Predictive performance

  • How consistent are the predictions for different bootstrap datasets?
  • How consistent are the predictions for counties with high school graduation rates in the middle of the plot vs. in the edges?
p_nc_boots